Back

Journal of Open Source Software

The Open Journal

Preprints posted in the last 30 days, ranked by how well they match Journal of Open Source Software's content profile, based on 22 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit.

1
TRaP: An Open-source, Reproducible Framework for Raman Spectral Preprocessing across Heterogeneous Systems

Zhu, Y.; Lionts, M. M.; Haugen, E.; Walter, A. B.; Voss, T. R.; Grow, G. R.; Liao, R.; McKee, M. E.; Locke, A.; Hiremath, G.; Mahadevan-Jansen, A.; Huo, Y.

2026-03-27 bioengineering 10.64898/2026.03.26.714582 medRxiv
Top 0.1%
17.4%
Show abstract

Raman spectroscopy offers a uniquely rich window into molecular structure and composition, making it a powerful tool across fields ranging from materials science to biology. However, the reproducibility of Raman data analysis remains a fundamental bottleneck. In practice, transforming raw spectra into meaningful results is far from standardized: workflows are often complex, fragmented, and implemented through highly customized, case-specific code. This challenge is compounded by the lack of unified open-source pipelines and the diversity of acquisition systems, each introducing its own file formats, calibration schemes, and correction requirements. Consequently, researchers must frequently rely on manual, ad hoc reconciliation of processing steps. To address this gap, we introduce TRaP (Toolbox for Reproducible Raman Processing), an open-source, GUI-based Python toolkit designed to bring reproducibility, transparency, and portability to Raman spectral analysis. TRaP unifies the entire preprocessing-to-analysis pipeline within a single, coherent framework that operates consistently across heterogeneous instrument platforms (e.g., Cart, Portable, Renishaw, and MANTIS). Central to its design is the concept of fully shareable, declarative workflows: users can encode complete processing pipelines into a single configuration file (e.g., JSON), enabling others to reproduce results instantly without reimplementing code or reverse-engineering undocumented steps. Beyond convenience, TRaP integrates configuration management, X-axis calibration, spectral response correction, interactive processing, and batch execution into a workflow-driven architecture that enforces deterministic, repeatable operations. Every transformation is explicitly recorded, making the full processing history transparent, inspectable, and reproducible. This eliminates ambiguity in how results are generated and ensures that identical protocols can be applied consistently across datasets and experimental contexts. Through representative use cases, we show that TRaP enables seamless, reproducible preprocessing of Raman spectra acquired from diverse platforms within a unified environment. We hope TRaP can empower Raman data processing as a reproducible, shareable, and systematized scientific practice, aligning it with modern standards for computational research. TRaP is released as an open-source software at https://github.com/hrlblab/TRaP

2
Track Hub Quickload Translator: Convert Track Hub or Quickload data for viewing in the UCSC Genome Browser or the Integrated Genome Browser

Freese, N. H.; Raveendran, K.; Sirigineedi, J. S.; Chinta, U. L.; Badzuh, P.; Marne, O.; Shetty, C.; Naylor, I.; Jagarapu, S.; Loraine, A.

2026-03-30 bioinformatics 10.64898/2026.03.26.708838 medRxiv
Top 0.1%
3.7%
Show abstract

SummaryTrack Hub Quickload Translator is a web application that interconverts University of California Santa Cruz (UCSC) Genome Browser track hub and Integrated Genome Browser (IGB) data repository formats by translating the track hub or Quickload configuration files to the other genome browsers required format. This new work enables researchers to work with tens of thousands of published genome assemblies for the first time using either browser. Availability and ImplementationTrack Hub Quickload Translator is implemented using Python 3 and freely available to use at translate.bioviz.org. Integrated Genome Browser is available from BioViz.org. Track Hub Quickload Translator, GenArk Genomes, and the Integrated Genome Browser source code is available from github.org/lorainelab. Contactaloraine@charlotte.edu

3
Correlate: A Web Application for Analyzing Gene Sets and Exploring Gene Dependencies Using CRISPR Screen Data

Deolankar, S.; Wermeling, F.

2026-04-04 bioinformatics 10.64898/2026.04.02.716070 medRxiv
Top 0.1%
1.8%
Show abstract

CRISPR screen data provides a valuable resource for understanding gene function and identifying potential drug targets. Here, we present Correlate, a freely accessible web application (https://correlate.cmm.se) that enables exploration of the Cancer Dependency Map (DepMap) CRISPR screen gene effects, hotspot mutations, and translocation/fusion data across more than 1,000 human cancer cell lines. The application supports two main use cases: (i) analysis of user-defined gene sets (e.g. CRISPR screen hits) to identify functionally linked genes based on correlations while providing an overview based on essentiality or user-provided screen statistics; and (ii) exploration of genes of interest in defined biological contexts, such as specific cancer types or mutational backgrounds, to generate hypotheses about gene function and dependencies. Additionally, Correlate supports experimental design by providing rapid overviews of gene essentiality and enabling the identification of cell lines with relevant mutational profiles. In contrast to knowledge-based approaches such as STRING and GSEA, which rely on prior biological annotations and curated interaction networks, Correlate identifies gene connections directly from functional CRISPR screen readouts, offering a complementary and data-driven perspective on gene network analysis. The application runs entirely in the browser, requires no installation or login, and integrates with the Green Listed v2.0 tool family for custom CRISPR screen design. HIGHLIGHTS{blacksquare} Interactive web-based platform for bulk correlation analysis of user-defined gene sets using DepMap CRISPR screen data, requiring no installation or programming expertise. {blacksquare}Identifies functional gene relationships from CRISPR screen readouts rather than curated annotations, offering a data-driven complement to tools such as GSEA and STRING. {blacksquare}Enables contextual exploration of gene dependencies across cancer types and mutational backgrounds, supporting hypothesis generation about gene function and therapeutic targets. {blacksquare}Supports experimental design through gene essentiality overviews, mutation and fusion analysis, and cell line identification, with optional integration of user-provided statistics from CRISPR screens, proteomics, or transcriptomics analyses.

4
StrucTTY: An Interactive, Terminal-Native Protein Structure Viewer

Jang, L. S.-e.; Cha, S.; Steinegger, M.

2026-03-19 bioinformatics 10.64898/2026.03.17.712308 medRxiv
Top 0.1%
1.5%
Show abstract

Terminal-based workflows are central to large-scale structural biology, particularly in high-performance computing (HPC) environments and SSH sessions. Yet no existing tool enables real-time, interactive visualization of protein backbone structures directly within a text-only terminal. To address this gap, we present StrucTTY, a fully interactive, terminal-native protein structure viewer. StrucTTY is a single self-contained executable that loads mulitple PDB and mmCIF files, normalizes three-dimensional coordinates, and renders protein structures as ASCII graphics. Users can rotate, translate, and zoom in on structures, adjust visualization modes, inspect chain-level features and view secondary structure assignments. The tool supports simultaneous visualization of up to nine protein structures and can directly display structural alignments using Foldseeks output, enabling rapid comparative analysis in headless environments. The source code is available at https://github.com/steineggerlab/StrucTTY. O_TEXTBOXKey MessagesO_LIReal-time, interactive protein structure visualization directly within text-only terminals C_LIO_LIASCII-based, depth-aware rendering of PDB and mmCIF backbone structures C_LIO_LIMulti-structure comparison with direct application of Foldseek alignment transformations C_LIO_LIDesigned for headless workflows on remote servers and HPC systems C_LI C_TEXTBOX

5
REBEL, Reproducible Environment Builder for Explicit Library resolution

Martelli, E.; Ratto, M. L.; Nuvolari, B.; Arigoni, M.; Tao, J.; Micocci, F. M. A.; Alessandri, L.

2026-04-07 bioinformatics 10.64898/2026.04.04.716498 medRxiv
Top 0.2%
0.9%
Show abstract

BackgroundAchieving FAIR-compliant computational research in bioinformatics is systematically undermined by two compounding challenges that existing tools leave unresolved: long-term reproducibility and accessibility. Standard package managers re-download dependencies from live repositories at every build, making environments vulnerable to library disappearance and version drift, and pinning a package version does not pin the versions of its transitive dependencies, causing divergences between builds performed at different points in time. Compounding this, packages from repositories such as CRAN, Bioconductor, and PyPI frequently omit critical system-level dependencies from their installation metadata, leaving users to manually discover which underlying library is missing or which version is required. Beyond these technical failures, constructing a truly reproducible environment demands expertise in containerization making reproducibility in practice a privilege and not a standard. FindingsWe present REBEL (Reproducible Environment Builder for Explicit Library Resolution), a framework that addresses both challenges through three dependency inference heuristics: (i) Deep Inspection of source code, (ii) Fuzzy Matching against a manually curated knowledge base, and (iii) Conservative Dependency Locking. The resolved dependency stack is then archived into a self-contained local store, enabling offline and deterministic rebuilds at any future time. We compared the installation of 1,000 randomly sampled CRAN packages in isolated Docker containers versus the standard package manager and REBEL resolved 149 of 328 standard installation failures (45.4%). Moreover through its DockerBuilder component, REBEL further generates fully reproducible Docker images from a plain text requirements file, making deterministic environment construction accessible without expertise in containerization. ConclusionsREBEL provides a practical foundation for FAIR-compliant, long-term reproducible bioinformatics analyses, making deterministic environment construction accessible to researchers regardless of their technical background. REBEL is freely available at https://github.com/Rebel-Project-Core

6
FuzzyClusTeR: a web server for analysis of tandem and diffuse DNA repeat clusters with application to telomeric-like repeats

Aksenova, A. Y.; Zhuk, A. S.; Lada, A. G.; Sergeev, A. V.; Volkov, K. V.; Batagov, A.

2026-03-23 bioinformatics 10.64898/2026.03.19.712643 medRxiv
Top 0.3%
0.7%
Show abstract

DNA repeats constitute a large fraction of eukaryotic genomes and play important roles in genome stability and evolution. While tandem repeats such as microsatellites have been extensively studied, the genomic organization and potential functions of dispersed or loosely organized repeat patterns remain poorly understood. Here we present FuzzyClusTeR, a web server for the identification, visualization and enrichment analysis of DNA repeat clusters in genomic sequences. Using parameterized metrics, FuzzyClusTeR detects both classical tandem repeats and regions where related motifs occur in proximity without forming perfect tandem arrays, which we term diffuse (or fuzzy) repeat clusters. The server supports analysis of user-defined sequences as well as genome-scale datasets, including the T2T-CHM13 and GRCh38 human genome assemblies, and provides interactive visualization and statistical tools for assessing the genomic distribution of repetitive motifs and corresponding clusters. As a demonstration, we analyzed telomeric-like repeats in the T2T-CHM13v2.0 genome and identified families of diffuse clusters enriched in these motifs. Comparison with simulated sequences suggests that these clusters represent non-random genomic patterns with potential evolutionary and functional significance. FuzzyClusTeR enables systematic exploration of repeat clustering across genomic regions or entire genomes. It is available at https://utils.researchpark.ru/bio/fuzzycluster GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=79 SRC="FIGDIR/small/712643v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@1844091org.highwire.dtl.DTLVardef@1ab0e1dorg.highwire.dtl.DTLVardef@12bc717org.highwire.dtl.DTLVardef@11bbec9_HPS_FORMAT_FIGEXP M_FIG C_FIG

7
geneslator: an R package for comprehensive gene identifier conversion and annotation

Cavallaro, G.; Micale, G.; Privitera, G. F.; Pulvirenti, A.; Forte, S.; Alaimo, S.

2026-04-01 bioinformatics 10.64898/2026.03.30.714723 medRxiv
Top 0.3%
0.5%
Show abstract

MotivationHigh-throughput sequencing generates large gene lists, making data interpretation challenging. Accurate gene annotation and reliable conversion between identifiers (e.g., gene symbols, Ensembl GeneIDs, Entrez GeneIDs) are essential for integrating datasets, conducting functional analyses, and enabling cross-species comparisons. Existing tools and databases facilitate annotation but often suffer from inconsistencies, missing mappings, and fragmented workflows, limiting reproducibility and interpretability. ResultsTo address these limitations, we developed geneslator, an R package that unifies gene identifier conversion, orthologs mapping, and pathway annotation across eight model organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana). geneslator provides an up-to-date, precise, and coherent framework that preserves data integrity, enables cross-species analyses, and facilitates robust interpretation of gene function and regulation, outperforming state-of-the-art gene annotation tools. Availabilitygeneslator is available at https://github.com/knowmics-lab/geneslator. Contactgrete.privitera@unict.it

8
Variable Resolution Maps (VRM) in CCTBX and Phenix: Accounting For Local Resolution In cryoEM

Afonine, P.; Adams, P. D.; Urzhumtsev, A. G.

2026-03-28 bioinformatics 10.64898/2026.03.25.714315 medRxiv
Top 0.4%
0.5%
Show abstract

Calculation of density maps from atomic models is essential for structural studies using crystallography and electron cryo-microscopy (cryoEM). These maps serve various purposes, including atomic model building, refinement, visualization, and validation. However, accurately comparing model-calculated maps to experimental data poses challenges, particularly because the resolution of cryoEM experimental maps varies across the map. Traditional crystallography methods generate finite-resolution maps with uniform resolution throughout the unit cell volume, while most modern software in cryoEM employ Gaussian-like functions to generate these maps, which does not adequately account for atomic model parameters and resolution. Recent work by Urzhumtsev & Lunin (2022, IUCr Journal, 9, 728-734) introduces a novel method for computing atomic model maps that incorporate local resolution and can be expressed as analytically differentiable functions of all atomic parameters. This approach enhances the accuracy of matching atomic models to experimental maps. In this paper, we detail the implementation of this method in CCTBX and Phenix. SynopsisNew tools implemented in CCTBX and Phenix allow the calculation of variable-resolution maps through a sum of atomic images expressed as analytic functions of all atomic parameters, along with their associated local resolution.

9
ATHILAfinder: a tool to detect ATHILA LTR retrotransposons in plant genomes

Bousios, A.; Primetis, E.

2026-03-22 bioinformatics 10.64898/2026.03.20.713144 medRxiv
Top 0.4%
0.4%
Show abstract

MotivationThe ATHILA lineage of LTR retrotransposons has colonised all branches of the plant tree of life. In Arabidopsis thaliana and A. lyrata, ATHILA elements have invaded centromeres, influencing the genetic and epigenetic organisation, and driving satellite evolution. To assess the broader significance of ATHILA across plants, a computational pipeline is needed to identify ATHILA elements with high efficiency. Existing tools lack this ability because they are optimised for broad transposon classification at the expense of precise annotation of lower taxonomic levels. ResultsWe present ATHILAfinder, a pipeline for accurate and large-scale discovery of ATHILA elements. ATHILAfinder uses lineage-specific sequence motifs as seeds and additional filters to build de novo intact elements. Homology-based steps rescue intact ATHILA and identify soloLTRs. A detailed identity card includes coordinates, LTR identity, coding capacity, length and other sequence features for every ATHILA. We validate ATHILAfinder in the A. thaliana Col-CEN assembly and five additional Brassicaceae species, covering four supertribes and [~]30 million years of evolution. ATHILAfinder has very low false positive rates and outperforms widely-used tools like EDTA and the deep-learning-based Inpactor2 software for both recovery and precision of ATHILA. To demonstrate its usefulness, we generate insights into ATHILA dynamics across Brassicaceae. OutlookFew computational pipelines target specific transposon lineages, yet such tools can empower their identification and downstream analyses. Our tailored approach can be adapted to other LTR retrotransposon lineages, offering new ways for high-resolution analysis of transposons.

10
TogoMCP: Natural Language Querying of Life-Science Knowledge Graphs via Schema-Guided LLMs and the Model Context Protocol

Kinjo, A. R.; Yamamoto, Y.; Bustamante-Larriet, S.; Labra-Gayo, J. E.; Fujisawa, T.

2026-03-23 bioinformatics 10.64898/2026.03.19.713030 medRxiv
Top 0.5%
0.4%
Show abstract

Querying the RDF Portal knowledge graph maintained by DBCLS--which aggregates more than 70 life-science databases--requires proficiency in both SPARQL and database-specific RDF schemas, placing this resource beyond the reach of most researchers. Large Language Models (LLMs) can, in principle, translate natural-language questions into executable SPARQL, but without schema-level context, they frequently fabricate non-existent predicates or fail to resolve entity names to database-specific identifiers. We present TogoMCP, a system that recasts the LLM as a protocol-driven inference engine orchestrating specialized tools via the Model Context Protocol (MCP). Two mechanisms are essential to its design: (i) the MIE (Metadata-Interoperability-Exchange) file, a concise YAML document that dynamically supplies the LLM with each target databases structural and semantic context at query time; and (ii) a two-stage workflow separating entity resolution via external REST APIs from schema-guided SPARQL generation. On a benchmark of 50 biologically grounded questions spanning five types and 23 databases, TogoMCP achieved a large improvement over an unaided baseline (Cohens d = 0.92, Wilcoxon p < 10-6), with win rates exceeding 80% for question types with precise, verifiable answers. An ablation study identified MIE files as the single indispensable component: removing them reduced the effect to a non-significant level (d = 0.08), while a one-line instruction to load the relevant MIE file recovered the full benefit of an elaborate behavioral protocol. These results suggest a general design principle: concise, dynamically delivered schema context is more valuable than complex orchestration logic. Database URLhttps://togomcp.rdfportal.org/

11
deluxpore: a Nextflow pipeline for demultiplexing Illumina dual-indexed Nanopore libraries

Arnaiz del Pozo, C.; Sanchis-Lopez, C.; Huerta-Cepas, J.

2026-03-30 bioinformatics 10.64898/2026.03.27.714410 medRxiv
Top 0.5%
0.4%
Show abstract

SummaryThe combination of target capture metagenomics and long-read sequencing represents a powerful approach for the characterisation of rare microbial taxa and their functional genes. However, standard Nanopore library preparations are incompatible with established capture protocols. A possible workaround is the preparation of Illumina libraries prior to ONT sequencing. Currently, this hybrid approach is hindered by a lack of specialised demultiplexing software capable of handling residual adapter fragments; Nanopores higher error rates and positional variability. Here, we present deluxpore: a Nextflow pipeline that demultiplexes Nanopore reads from Illumina dual-indexed libraries (NEBNext and Nextera) using BLAST alignment and Levenshtein distance matching. Extensive benchmarking across 18 replicates validates the viability and precision of this hybrid indexing approach. Benchmarking demonstrates that accurate demultiplexing requires minimum Q20 data quality and strategic index selection. Unique index-to-sample designs achieved 91.7% sample recovery at Q20 versus 46.1% for combinatorial approaches. We also identified high-crosstalk index pairs within NEBNext Primer Set A and provide an optimized 8-sample configuration achieving ~95% accuracy at Q20. deluxpore enables reliable, automated demultiplexing for hybrid capture-long-read sequencing workflows. Availability and implementationdeluxpore is implemented in Nextflow, Python, and Bash under the GNU GPL v3.0. Source code, documentation, and benchmarking workflows are available at https://github.com/compgenomicslab/deluxpore and https://github.com/compgenomicslab/deluxpore-benchmarking.

12
Helicase: Vectorized parsing and bitpacking of genomic sequences

Martayan, I.; Lobet, L.; Marchet, C.; Paperman, C.

2026-03-22 bioinformatics 10.64898/2026.03.19.712912 medRxiv
Top 0.5%
0.4%
Show abstract

Modern sequencing pipelines routinely produce billions of reads, yet the dominant storage formats (FASTQ and FASTA) are text-based and sequential, making high-throughput parsing a persistent bottleneck in bioinformatics. Their regular, line-oriented structure makes them well-suited to SIMD vectorization, but existing libraries do not fully exploit it. We present vectorized algorithms for high-throughput FASTA/Q parsing, with on-the-fly handling of non-ACTG characters and built-in bitpacking of DNA sequences into multiple compact representations. The parsing logic is expressed as a finite state machine, compiled into efficient SIMD programs targeting both x86 and ARM CPUs. These algorithms are implemented in Helicase, a Rust library exposing a tunable interface that retrieves only caller-requested fields, minimizing unnecessary work. Exhaustive benchmarks across a wide range of CPUs show that Helicase meets or exceeds the throughput of all evaluated state-of-the-art libraries, making it the fastest general-purpose FASTA/Q parser to our knowledge. Availabilityhttps://github.com/imartayan/helicase.

13
EMITS: expectation-maximization abundance estimation for fungal ITS communities from long-read sequencing

O'Brien, A.; Lagos, C.; Fernandez, K.; Ojeda, B.; Parada, P.

2026-04-02 bioinformatics 10.64898/2026.03.31.715662 medRxiv
Top 0.5%
0.4%
Show abstract

As long-read amplicon sequencing becomes routine for fungal metabarcoding, species-level abundance estimation from ITS amplicons remains limited by naive best-hit classification, which misattributes reads among closely related species sharing similar ITS sequences and fragments abundance across redundant database entries. Here we present EMITS, a Rust-based tool that applies expectation-maximization (EM) to iteratively resolve ambiguous read-to-reference mappings from minimap2 alignments against the UNITE database, producing probabilistic specieslevel abundance estimates. EMITS includes platform-specific presets for Oxford Nanopore and PacBio chemistries and performs taxonomic aggregation across UNITE accessions. We validated EMITS using three complementary approaches: controlled simulations with tunable alignment noise, an Oxford Nanopore mock community of 10 fungal species with known composition, and a synthetic community of 21 species derived from UNITE reference sequences. In simulations, EM reduced L1 error by 80-92% compared to naive counting under realistic noise conditions. On the ONT mock community, EM correctly resolved within-genus species assignments where naive counting misattributed reads (e.g., Trichophyton mentagrophytes vs. T. simii; Penicillium species) and consolidated abundance across redundant database accessions. On the synthetic community, EM reduced false positive abundance by 54% and improved overall accuracy by 13.4%. Together with ITSxRust [OBrien et al., 2026] for upstream ITS extraction, EMITS provides a complete high-performance pipeline for long-read fungal amplicon profiling.

14
Breaking the Extraction Bottleneck: A Single AI Agent Achieves Statistical Equivalence with Human-Extracted Meta-Analysis Data Across Five Agricultural Datasets

Halpern, M.

2026-03-23 bioinformatics 10.64898/2026.02.17.706322 medRxiv
Top 0.5%
0.4%
Show abstract

BackgroundData extraction is the primary bottleneck in meta-analysis, consuming weeks of researcher time with single-extractor error rates of 17.7%. Existing LLM-based systems achieve only 26-36% accuracy on continuous outcomes, and no study has validated AI-extracted continuous data against multiple independent datasets using formal equivalence testing. MethodsA single AI agent (Claude Opus 4.6) extracted treatment means, control means, sample sizes, and variance measures from source PDFs across five published agricultural meta-analyses spanning zinc biofortification, biostimulant efficacy, biochar amendments, predator biocontrol, and elevated CO2 effects on plant mineral nutrition. Observations were matched to reference standards using an LLM-driven alignment method. Validation employed proportional TOST equivalence testing, ICC(3,1), Bland-Altman analysis, and source-type stratification. ResultsAcross five datasets, the agent produced 1,149 matched observations from 136 papers. Pearson correlations ranged from 0.984 to 0.999. Proportional TOST confirmed statistical equivalence for all five datasets (all p < 0.05). Table-sourced observations achieved 5.5x lower median error than figure-sourced observations. Aggregate effects were reproduced within 0.01-1.61 pp of published values. Independent duplicate runs confirmed extraction stability (within 0.09-0.23 pp). ConclusionsA single AI agent achieves statistical equivalence with human-extracted meta-analysis data across five independent agricultural datasets. The approach reduces extraction cost by approximately one to two orders of magnitude while maintaining accuracy sufficient for aggregate meta-analytic pooling. HighlightsO_ST_ABSWhat is already knownC_ST_ABSO_LIData extraction is the primary bottleneck in meta-analysis, with single-extractor error rates of 17.7% C_LIO_LIExisting LLM-based extraction systems achieve only 26-36% accuracy on continuous outcomes C_LIO_LINo study has validated AI extraction against multiple independent datasets using formal equivalence testing C_LI What is newO_LIA single AI agent achieves statistical equivalence with human-extracted data across five agricultural meta-analyses (1,149 observations, 136 papers) C_LIO_LILLM-driven alignment resolves the previously underappreciated bottleneck of moderator matching, improving correlations from 0.377-0.812 to 0.984-0.997 without changing extracted values C_LIO_LITable-sourced observations achieve 5.5x lower error than figure-sourced data C_LI Potential impact for RSM readersO_LIProvides a validated, reproducible workflow for AI-assisted data extraction in meta-analysis C_LIO_LIDemonstrates that most apparent "extraction error" in validation studies is actually alignment error C_LIO_LIOffers practical quality signals (source-type labeling) for downstream meta-analysts C_LI

15
Scalable computation of ultrabubbles in pangenomes by orienting bidirected graphs

Harviainen, J.; Sena, F.; Moumard, C.; Politov, A.; Schmidt, S.; Tomescu, A. I.

2026-03-31 bioinformatics 10.64898/2026.03.28.714704 medRxiv
Top 0.5%
0.3%
Show abstract

MotivationPangenome graphs are increasingly used in bioinformatics, ranging from environmental surveillance and crop improvement to the construction of population-scale human pangenomes. As these graphs grow in size, methods that scale efficiently become essential. A central task in pangenome analysis is the discovery of variation structures. In directed graphs, the most widely studied such structures, superbubbles, can be identified in linear time. Their canonical generalization to bidirected graphs, ultrabubbles, more accurately models DNA reverse complementarity. However, existing ultrabubble algorithms are quadratic in the worst case. ResultsWe show that all ultrabubbles in a bidirected graph containing at least one tip or one cutvertex--a common property of pangenome graphs--can be computed in linear time. Our key contribution is a new linear-time orientation algorithm that transforms such a bidirected graph into a directed graph of the same size, in practice. Orientation conflicts are resolved by introducing auxiliary source or sink vertices. We prove that ultrabubbles in the original bidirected graph correspond to weak superbubbles in the resulting directed graph, enabling the use of existing lineartime algorithms. Our approach achieves speedups of up to 25xover the ultrabubble implementation in vg, and of more than 200x over BubbleGun, enabling scalable pangenome analyses. For example, on the v2.0 pangenome graph constructed by the Human Pangenome Reference Consortium from 232 individuals, after reading the input, our method completes in under 3 minutes, while vg requires more than one hour, and four times more RAM. AvailabilityOur method is implemented in the BubbleFinder tool github.com/algbio/BubbleFinder, via the new ultrabubbles subcommand. Contactalexandru.tomescu@helsinki.fi

16
BCAR: A fast and general barcode-sequence mapper for correcting sequencing errors

Andrews, B.; Ranganathan, R.

2026-03-31 bioinformatics 10.64898/2026.03.27.714882 medRxiv
Top 0.5%
0.3%
Show abstract

MotivationDNA barcodes are commonly used as a tool to distinguish genuine mutations from sequencing errors in sequencing-based assays. In the presence of indel errors, utilizing barcodes requires accurate alignment of the raw reads to distinguish genuine indels from indel errors. Existing strategies to do this generally rely on aligners built for homology comparison and do not fully utilize quality scores. We reasoned that developing an aligner purpose-built for error correction could yield higher quality barcode-sequence maps. ResultsHere, we present BCAR, a fast barcode-sequence mapper for correcting sequencing errors. BCAR considers all of the evidence for each base call at each position both during alignment and during final consensus generation. BCAR creates high-accuracy barcode-sequence maps from simulated reads across a broad range of error rates and read lengths, outperforming existing methods. We apply BCAR to two experimental datasets, where it generates high-quality barcode-sequence maps. Availability and implementationBCAR source code, documentation and test data are available from: https://github.com/dry-brews/BCAR

17
ECHO: a nanopore sequencing-based workflow for (epi)genetic profiling of the human repeatome

Poggiali, B.; Putzeys, L.; Andersen, J. D.; Vidaki, A.

2026-03-20 bioinformatics 10.64898/2026.03.18.712618 medRxiv
Top 0.6%
0.3%
Show abstract

SummaryThe human genome is dominated by repetitive DNA, whose genetic and epigenetic variation plays a key role in gene regulation, genome stability, and disease. Recent advances in long-read sequencing now enable large-scale, haplotype-resolved, and DNA methylation-informative analysis of the human genome, including on previously inaccessible complex and repetitive regions. However, the comprehensive, simultaneous characterisation of the "human repeatome" remains challenging, largely due to the lack of comprehensive tools integrated in a single pipeline that can capture the full spectrum of variation across diverse types of DNA repeats. Here, we present ECHO, a user-friendly, Snakemake-based pipeline for the "(Epi)genomic Characterisation of Human Repetitive Elements using Oxford Nanopore Sequencing". ECHO provides a reproducible and scalable framework for end-to-end analysis of whole-genome nanopore sequencing data, enabling integrative but also tailored (epi)genetic analyses of the human repeatome. Availability and implementationECHO is freely available at Github: https://github.com/leenput/ECHO-pipeline, with the archived version at Zenodo: https://zenodo.org/records/19068468 Contactathina.vidaki@mumc.nl; athina.vidaki@maastrichtuniversity.nl

18
Visualizing and sonifying neurodata (ViSoND) for enhanced observation

Blankenship, L.; Sterrett, S. C.; Martins, D. M.; Findley, T. M.; Abe, E. T. T.; Parker, P. R. L.; Niell, C.; Smear, M. C.

2026-03-24 neuroscience 10.64898/2026.03.21.713430 medRxiv
Top 0.6%
0.3%
Show abstract

Neuroscience needs observation. Observation lets us evaluate data quality, judge whether models are biologically realistic, and generate new hypotheses. However, high-dimensional behavioral and neural data are too complex to be easily displayed and eye-tested. Computational methods can reduce the dimensionality of data and reveal statistically robust dynamical structure but often yield results that are difficult to relate back to the underlying biology. In addition, the choice of what parameters to quantify may not capture unexpectedly relevant aspects of the data. To supplement quantification with enhanced qualitative observation, we developed Visualization and Sonification of NeuroData (ViSoND), an open-source approach for displaying multiple data streams using video and sonification. Sonification is nothing new to neuroscience. Scientists have sonified their physiological preparations since Lord Adrians earliest recordings. We extend this tradition by mapping multiple physiological datastreams to musical notes using MIDI. Synchronizing MIDI to video provides an opportunity to watch an animals movement while listening to physiological signals such as action potentials. Here we provide two demonstrations of this approach. First, we used ViSoND to interpret behavioral structure revealed by a computational model trained on the breathing rhythms of freely behaving mice. Second, ViSoND revealed patterns of neural activity in mouse visual cortex corresponding to eye blinks, events that were previously filtered out of analysis. These use cases show that ViSoND can supplement quantitative rigor with observational interpretability. Additionally, ViSoND provides an accessible way to display data which may broaden the audience for communication of neuroscientific findings.

19
NeoDBS: Open-Source Platform for Visualization and Analysis of Electrophysiological Recordings from Deep Brain Stimulation Systems

Rodrigues, L.; Ferreira, A.; Pereira, I.; Moreira, R.; Jacinto, L.

2026-03-30 bioengineering 10.64898/2026.03.27.714691 medRxiv
Top 0.6%
0.3%
Show abstract

Optimization of deep brain stimulation (DBS) therapy for neurological and neuropsychiatric disorders depends on objective quantitative biomarkers that can guide stimulation parameter adjustments. With the recent introduction of new-generation DBS systems capable of simultaneously stimulating brain activity and recording local field potentials (LFP), there is increasing demand for platforms that enable efficient visualization and analysis of these signals for electrophysiological biomarkers identification. To address the limitations of currently available toolboxes that require advanced signal processing skills and rely on proprietary software, we present NeoDBS, an open-source Python platform designed for ingestion and advance signal visualization and processing of LFP signals from DBS systems through an easy-to-use graphical interface. NeoDBS is a user-centered platform that offers predefined analysis pipelines with the aim of facilitating electrophysiological biomarker investigation for DBS across different brain disorders. Custom analysis pipelines are also available for users to leverage the signal analysis tools to their research needs. Critical functionalities for longitudinal biomarker research are featured in NeoDBS, such as batch file processing and event-locked analysis for in-clinic and at-home recordings. This combination of accessibility, user-experience and advanced signal processing tools makes NeoDBS an environment that propels easy and fast electrophysiological biomarker research for DBS, across patients, sessions, and stimulation parameters.

20
Benchmarking Agentic Bioinformatics Systems for Complex Protein-Set Retrieval: A Coccolithophore Calcification Case Study

Zhang, X.

2026-04-02 bioinformatics 10.64898/2026.03.28.715041 medRxiv
Top 0.6%
0.3%
Show abstract

Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.